Produced while being an affiliate at PIBBSS[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.
What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because
Promoted to curated: Formalizing what it means for transformers to learn "the underlying world model" when engaging in next-token prediction tasks seems pretty useful, in that it's an abstraction that I see used all the time when discussing risks from models where the vast majority of the compute was spent in pre-training, where the details usually get handwaived. It seems useful to understand what exactly we mean by that in more detail.
I have not done a thorough review of this kind of work, but it seems to me that also others thought the basic ideas in the work hold up, and I thought reading this post gave me crisper abstractions to talk about this kind of stuff in the future.
If the transcoders are used to predict next tokens, they may lose interpretability
Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn't care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, n...
Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given...
I would definitely like to see quantification of the degree to which MELBO elicits natural, preexisting behaviors. One challenge in the literature is: you might hope to see if a network "knows" a fact by optimizing a prompt input to produce that fact as an output. However, even randomly initialized networks can be made to output those facts, so "just optimize an embedded prompt using gradient descent" is too expressive.
One of my hopes here is that the large majority of the steered behaviors are in fact natural. One reason for hope is that we aren't o...
[For background & spelling out the acronyms in the title, see: Debate (AI safety technique), Recursive Reward Modeling, Eliciting Latent Knowledge, Natural Abstractions.]
When I say “Why I’m not working on X”, I am NOT trying to find a polite & diplomatic way to say “Nobody should work on X because X is unhelpful for AGI safety”. Hmm, OK, well, maybe it’s just a little bit that. But really, I don’t feel strongly. Instead, I think:
Traditionally, those techniques are focused on what the model is outputting, not what the model’s underlying motivations are. But I haven’t read all the literature. Am I missing something?
It's confusing to me as well, perhaps because different people (or even the same person at different times) emphasize different things within the same approach, but here's one post where someone said, "It is important that the overseer both knows which action the distilled AI wants to take as well as why it takes that action."
Top labs use various forms of “safety training” on models before their release to make sure they don’t do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don’t get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI.
Topics we discuss:
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.